Analyzing the Game of Survivor -- Analyzing Sentiment of Reddit Posts (3)¶
Welcome back, to the third installment of the Survivor analysis series! My last post began to investigate the [relationship between Reddit mentions and particular contestants]. The next step is to dive a bit deeper into the Reddit comments themselves, using some basic NLP techniques.
In this third installment of the series, I will continue to digging into some of the Reddit data that I had collected via the Pushift.io API. For more information on how this data was collected, please check out the [first article in this series], where I describe the ETL process for the Pushift (as well as other!) data.
Introduction to NLP and Sentiment Analysis¶
We will be using a similar query to the first post on this issue, so I will skip over any explanations here.
A quick explanation of NLP -- and a disclaimer: I have done some work in this field, but am by no means an expert. The next few paragraphs gives a brief overview of some of the topics in this area. You are encouraged to dive deeper into this yourself. The following isn't necessary to understand this analysis, but I wanted to give a quick overview of some of the challenges in this field and some of the potential shortcoming of this analysis before I dive in.
Sentiment analysis is a type of analysis in the realm of Natural Language Processing which investigates the sentiment of particular words and sentences. There is quite a lot of research on this topic, but essentially an annotated list of sentences is used to generate a model to predict the sentiment of the sentence using the attributes of the text itself.
Extracting these attributes is an article in and to itself, and can be done in quite a few ways. The simplest, and most widely known, is using a bag of words approach. In this approach, each word is first tokenized to represent its meaning, and then across a variety of examples is vectorized based on the counts of each words. This assumes that essentially, everything interesting about a sentence can be broken down into its components (words). That is, the whole is the sum of the parts.
However, we know this isn't the case. Words have contextual meaning as well, and there are correlations with other words in a sentence. In this space, there are some contextual encoders that try to handle this issue, like ELMO and BERT. That is a topic for another time, however.
NLP is a rich field, and there are many topics of interest in the field. For this analysis, we will mostly be glossing over the most important portions of this to use pretrained models for sentiment analysis.
Often times in these kinds of tasks, if the problem is generalizable enough, using pretrained models from a general corpus (like Wikipedia, or reviews across industries, for example) provides a sufficient model for your use-case. Of course, there may be domain specific meaning to particular words or phrases. For instance, a sentence:
I hate Russell.
is semantically very similar to:
I hate Frosted Flakes.
but
Tony got the third immunity idol.
has a Survivor-specific meaning in terms the word "immunity". In other contexts, like:
I am immune to the Chicken Pox, since I've already had it.
the word immunity has an essentially different meaning. In fact, in no cases (in general) will the word "Immunity" mean exactly the same thing as in a Survivor context. This is important to recognize, as it shows us the limitations of using a generalized model.
For this reason, there is often a desire to fine-tune these generalized models to a specific use-case.
However, while this is true, the word fine-tune is quite intentional. The generalized model handles the bulk of the work (usually a neural network in the case of embeddings) -- the fine-tuning has a smaller, incremental benefit as compared to the generalizable model. The general model gets you a lot of the way there.
While the topics above (contextual embeddings and fine-tuning models to domain-specific applications) are certainly of interest, and could potentially improve the models, I will leave this (and all model fitting) to a later date. For now, we will just use publically available, out-of-the-box solutions to get a quick view on the sentiment. Then, if the results seem worthwhile to dig deeper into, I will investigate in future articles building contextual models to gain more insight into the semantics and context of the sentences.
import os
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
import statsmodels.api as sm
from IPython.display import HTML, display
from plotly.express import scatter
pg_un, pg_pw, pg_ip, pg_port = [os.getenv(x) for x in ['PG_UN', 'PG_PW', 'PG_IP', 'PG_PORT']]
def pg_uri(un, pw, ip, port):
return f'postgresql://{un}:{pw}@{ip}:{port}'
eng = create_engine(pg_uri(pg_un, pg_pw, pg_ip, pg_port))
sql = '''
WITH contestants_to_seasons AS (
SELECT c.contestant_id, c.first_name,
c.last_name, cs.contestant_season_id, c.sex,
cs.season_id, occupation, location, age, placement,
days_lasted, votes_against,
med_evac, quit, individual_wins, attempt_number,
tribe_0, tribe_1, tribe_2, tribe_3, alliance_0,
alliance_1, alliance_2,
challenge_wins, challenge_appearances, sitout,
voted_for_bootee, votes_against_player, character_id,
r.role, r.description,
total_number_of_votes_in_episode, tribal_council_appearances,
votes_at_council, number_of_jury_votes, total_number_of_jury_votes,
number_of_days_spent_in_episode, days_in_exile,
individual_reward_challenge_appearances, individual_reward_challenge_wins,
individual_immunity_challenge_appearances, individual_immunity_challenge_wins,
tribal_reward_challenge_appearances, tribal_reward_challenge_wins,
tribal_immunity_challenge_appearances, tribal_immunity_challenge_wins,
tribal_reward_challenge_second_of_three_place, tribal_immunity_challenge_second_of_three_place,
fire_immunity_challenge, tribal_immunity_challenge_third_place, episode_id
FROM survivor.contestant c
RIGHT JOIN survivor.contestant_season cs
ON c.contestant_id = cs.contestant_id
JOIN survivor.episode_performance_stats eps
ON eps.contestant_id = cs.contestant_season_id
JOIN survivor.role r
ON cs.character_id = r.role_id
), matched_exact AS
(
SELECT reddit.*, c.*
FROM survivor.reddit_comments reddit
JOIN contestants_to_seasons c
ON (POSITION(c.first_name IN reddit.body) > 0
OR POSITION(c.last_name IN reddit.body) > 0)
AND c.season_id = reddit.within_season
AND c.episode_id = reddit.most_recent_episode
WHERE within_season IS NOT NULL
)
SELECT *
FROM matched_exact m
'''
reddit_df = pd.read_sql(sql, eng)
ep_df = pd.read_sql('SELECT * FROM survivor.episode', eng)
season_to_name = pd.read_sql('SELECT season_id, name AS season_name FROM survivor.season', eng)
reddit_df = reddit_df.merge(season_to_name, on='season_id')
reddit_df.rename(columns={'name': 'season_name'}, inplace=True)
reddit_df = reddit_df.merge(ep_df.drop(columns=['season_id']), on='episode_id')
reddit_df['created_dt'] = pd.to_datetime(reddit_df['created_dt'])
pd.options.display.max_columns = 100
TextBlob¶
To analyze the sentiment of the comments, we will be using textblob, a NLP package that builds on top of NLTK and pattern, two very popular NLP libraries. They have a very simple API which will allow us to do alot of interesting analysis right out of the box!
We will be using their pre-trained sentiment analysis to extract the sentiment from a word.
To give a sense of what this looks like, let's look at a dummy example:
from textblob import TextBlob
text = 'I love Katy Perry.'
blob = TextBlob(text)
blob.sentiment
blob.noun_phrases
In each of these cases, we will get a polarity which represents the actual sentiment (best being 1, worst being -1) and subjectivity, which tries to analyze how subjective the sentence itself it (0 being "objective", 1 being subjective.)
We will take a look at both in this analysis.
Let's look at a particular comment and see what we will want to look at.
idx_two_example = reddit_df[reddit_df['body'].str.contains('Tyler made the absolutely best strategic move he could have. Will... is an idiot.')].index[0]
example_comment = reddit_df['body'].iloc[idx_two_example]
example_comment
The first thing we can do is just look at the sentiment of this entire paragraph. That's straightforward enough:
b = TextBlob(example_comment)
TextBlob(example_comment).sentiment
But there's a few thing with this. First off, we notices there are two different names in the above sentence, with different sentiments towards both. How do we associate the sentiment to the correct person?
This is a hard problem. Essentially, we would have to find a way to match each noun to a verb or adjective. This isn't a trivial task, and assumes a lot -- that we can identify the Part of Speech (not a given at all), and then we can unambiguously match them. In short, this is not easy to do and, as far as I can see, there's no way to handle this out the box with TextBlob.
There is, however, a somewhat "close enough" way to handle this problem. For now, since we're just looking at things on aggregate, this is what we will do.
First, we can extract all of the relevant contestants from the string. Remember, each comment is replicated for each possible subject, if there are multiple names. In this case, we have:
example_subject = reddit_df['first_name'].iloc[idx_two_example]
example_subject
It's Will. Looking at the above sentences, it looks like this person doesn't like Will too much. And that Tyson made the "absolutely best strategic move". So let's first extract only the sentences that have Will's name in it, and look at the average sentiment of these sentences.
[x.sentiment for x in TextBlob(example_comment).sentences]
np.mean([x.sentiment.polarity for x in TextBlob(example_comment).sentences if example_subject in x])
Let's compare this with the other potential subject, on the next row:
example_subject = reddit_df['first_name'].iloc[idx_two_example + 1]
print(example_subject)
np.mean([x.sentiment.polarity for x in TextBlob(example_comment).sentences if example_subject in x])
We see that this meets our expectations, rather than just using the overall polarity.
Note that this method is far from perfect, but short of being able to associate each potential subject-adjective pair, it will perform well enough for our purposes.
To give an example of how difficult this problem can be, take a look at this comment:
idx_many_example = reddit_df[reddit_df['body'].str.contains('I am really wishing for a David win, partly because')].index[0]
reddit_df['body'].iloc[idx_many_example]
Look how many names we have there! By my count there is:
- David
- Tony
- Mike
- Kristie
- Zeke
- Yul
- Jonathan
- Jessica
- Jay
- Jeff (Probst!)
- Cochran
- Michelle
Somehow, this person managed to jam that many names into so few words! (Impressive!)
In this case, using the above method doesn't work so well.
We'll stop here for now, but potentially there could be some interesting venues to go down associating subjects/objects with verbs and adjectives.
def extract_overall_and_sentence_sentiment(row):
body = row['body']
first = row['first_name']
last = row['last_name']
b = TextBlob(body)
overall_sentiment = b.sentiment
overall_polarity = overall_sentiment.polarity
overall_subj = overall_sentiment.subjectivity
rel_sentences_sentiment = [x.sentiment
for x in b.sentences
if (first in x) or (last in x)]
sentence_polarity = np.mean([y.polarity for y in rel_sentences_sentiment])
sentence_subj = np.mean([y.subjectivity for y in rel_sentences_sentiment])
metrics = [overall_polarity, overall_subj, sentence_polarity, sentence_subj]
metric_names = ['overall_polarity', 'overall_subj', 'sentence_polarity', 'sentence_subj']
return pd.Series(metrics, index=metric_names)
def add_sentiment_cols(df):
sentiment_df = df.apply(extract_overall_and_sentence_sentiment, axis=1)
return pd.concat([df, sentiment_df], axis=1)
reddit_df = add_sentiment_cols(reddit_df)
Let's take a look at one of the comments to get a sense of what this looks like!
reddit_df.sample(1)[['body', 'first_name', 'last_name',
'overall_subj', 'sentence_subj',
'overall_polarity', 'sentence_polarity']].to_dict()
It's important to remember that there seems to be a lot of discussion about which contestants could win or should win or whether they made the right decision or not. This is, of course, a separate question from sentiment -- there are plenty of people who make good strategic moves but are not the most well liked, by fans or by other contestants. Strong words in either direction ("hate", "love", etc.)
Out of curiosity, let's take a look at the overall feelings of sentiment in these comments.
from plotly.express import bar, histogram, box
histogram(data_frame=reddit_df, x='sentence_polarity', nbins=50, title='Sentence Sentiment Polarity')
histogram(data_frame=reddit_df, x='overall_polarity', nbins=50, title='Overall Sentiment Polarity')
The first thing that stands out is that there are a large number of 0 sentiments for both the sentence and overall polarity.
(reddit_df['sentence_polarity'] == 0).mean()
(reddit_df['overall_polarity'] == 0).mean()
To only look at cases where there is some sentiment, let's look at a histogram of non-zero values:
histogram(data_frame=reddit_df[reddit_df['sentence_polarity'] != 0],
x='sentence_polarity', nbins=50,
title='Non-Zero Sentence Sentiment Polarity',
)
histogram(data_frame=reddit_df[reddit_df['overall_polarity'] != 0],
x='overall_polarity', nbins=50,
title='Non-Zero Overall Sentiment Polarity',
)
We see that in both cases, it seems that most comments are positive (a slight skew to the left). This is interesting -- anyone who says redditors are inherently negative would probably disagree! To break this down a bit more, let's tae a look at the different seasons:
box(data_frame=reddit_df[reddit_df['sentence_polarity'] != 0],
x='sentence_polarity', color='season_name',
title='Non-Zero Sentence Sentiment Polarity by Season',
)
Box plots by each season don't immediately reveal too much -- it appears that the distributions are somewhat similar for all of the seasons.
Next, we will look at some of the individual contestant episode combinations to see which have the highest (and lowest) skewing distributions.
reddit_df['contestant_episode'] = reddit_df['first_name'] + ' ' + reddit_df['last_name'] + ' ' + reddit_df['episode_name']
def plot_grouped_sentiment_extremes(df, extreme_group='contestant_episode',
plot_group='contestant_episode',
sentiment_col='sentence_polarity',
lower=.001, upper=.999, size_thresh=10,
include_zeros=True, *args, **kwargs):
if not include_zeros:
df = df[df[sentiment_col] != 0].reset_index(drop=True)
df = df[df[plot_group].notnull()].reset_index(drop=True)
df = df[df.groupby(extreme_group)[sentiment_col].transform(len) > size_thresh].reset_index(drop=True)
lower_quart, upper_quart = df.groupby(extreme_group)[sentiment_col].mean().quantile([lower, upper])
def include_bool(x):
return ((x.mean() <= lower_quart) or (x.mean() >= upper_quart))
best_worst_episodes_bool = df.groupby(extreme_group)[sentiment_col].transform(include_bool)
df['order_by'] = df.groupby(plot_group)[sentiment_col].transform(lambda x : x.median())
plot_df = df[best_worst_episodes_bool].sort_values('order_by').reset_index()
bx = box(data_frame=plot_df,
x=sentiment_col, color=plot_group,
category_orders= dict(group=plot_df[plot_group].unique().tolist()),
*args, **kwargs)
return bx
def create_story_contestant_episode(contestant_episode, init_text=''):
example = reddit_df[reddit_df['contestant_episode'] == contestant_episode].iloc[0]
try:
sentences = TextBlob(example['story']).sentences
except TypeError:
return
display(HTML(f'Results from <i>Story</i> aspect of the Wiki for {contestant_episode}...'))
relevant = [s for s in sentences if example['first_name'] in s or example['last_name'] in s]
full_story = init_text + '</br></br></br>' if init_text else ''
for r in relevant:
story = str(r)
story = story + '</br>'
emphasized = bolden(story, example['first_name'])
emphasized = bolden(emphasized, example['last_name'])
full_story += emphasized
full_story = force_breaks(full_story)
return full_story
def display_with_breaks(text, every=120, split_on_words=False):
display(HTML(force_breaks(text, every=every, split_on_words=split_on_words)))
def bolden(full_string, bolden):
boldened = '<b>' + bolden + '</b>'
return full_string.replace(bolden, boldened)
def highlight_comment(row, sentiment_col='sentence_polarity'):
body = row['body']
sent = row[sentiment_col]
for n in row[['first_name', 'last_name']]:
body = bolden(body, n)
ret_str = f'<b>Comment</b>: {body} </br></br/> <b>Sentiment</b>: {sent}</br></br>'
return ret_str
def get_example_comments(contestant_episode, polarity_var='sentence_polarity', n=1, polarity=-1):
subset = reddit_df[reddit_df['contestant_episode'] == contestant_episode]
subset.sort_values(by=polarity_var, ascending=(polarity <= 0), inplace=True)
s_top = subset.iloc[0:n]
s_top.apply(lambda x: display_with_breaks(highlight_comment(x, polarity_var)), axis=1)
plot_grouped_sentiment_extremes(reddit_df, lower=.005, upper=.995, include_zeros=False)
Interestingly, we see some of the most and least popular players/episodes in this graph.
It appears that the players who were liked (for instance, Malcolm in Kill or Be Killed and John Cody in Blood is Thicker than Anything) seemed to have a smaller variance -- meaning they are more unanimously liked.
While others who were disliked (like Colton Cumbie in One World Is Out the Window and Ozzy from Double Agent) had a larger range of values for the 1st and 3rd quantile, indicating that the sentiment towards them are are more divided among Redditors.
Interestingly, even some of the least "liked" contestant episode combinations (like Sophie Clarke in Cult Like) are above zero, which reflects the average positive sentiment for most of these comments.
Disliked Player -- Examples¶
"Worst" things people said about Ozzy:¶
get_example_comments('Ozzy Lusth Double Agent', n=4)
"Best" things people said about Ozzy¶
get_example_comments('Ozzy Lusth Double Agent', n=4, polarity=1)
"Worst" Things People Said About Colton¶
get_example_comments('Colton Cumbie One World Is Out the Window', n=4)
"Best" things people said about Colton¶
get_example_comments('Colton Cumbie One World Is Out the Window', n=4, polarity=1)
Liked Player -- Examples¶
"Best" things people said about Malcolm:¶
get_example_comments('Malcolm Freberg Kill or Be Killed', n=4, polarity=1)
"Worst" things people said about Malcolm:¶
get_example_comments('Malcolm Freberg Kill or Be Killed', n=4, polarity=-1)
"Best" things people said about John Cody:¶
get_example_comments('John Cody Blood Is Thicker than Anything', n=4, polarity=1)
Worst things people said about John Cody:¶
get_example_comments('John Cody Blood Is Thicker than Anything', n=4, polarity=-1)
plot_grouped_sentiment_extremes(reddit_df, sentiment_col='overall_polarity',
lower=.005, upper=.995, include_zeros=False)
For the overall sentiment, it seems like we are getting some leakage from the other potential contestants. This is still interesting in some ways -- for instance, if you compare the distribution of Colton Cumbie in One World is Out The Window between the two graphs, you can see that Colton had a larger distribution with the sentence polarity as compared to the overall polarity.
This seems to indicate that comments concerning Colton are more divided than comments mentioning Colton, which indicates his controversiality. Also, comments concerning Colton has a slightly lower median (-.125, -.0625).
Additionally, and somewhat subjectively, we notice a few characters who were associated or allied with people who were liked (or disliked) more than they themselves were noticed or liked. For instance, Ken McNickle and Woo Hwang with David and Tony, respectively, from their seasons.
You can see below that there are some relationships between the person who is speaking and general characters in the game. For instance, Caleb Bankston is mentioned often with Colton, because they are enggaged. However, Caleb appears to be generally well liked -- especially in his relation to other characters.
get_example_comments('Caleb Bankston Blood Is Thicker than Anything', polarity_var='overall_polarity',
n=4, polarity=-1)
get_example_comments('Caleb Bankston Blood Is Thicker than Anything', polarity_var='overall_polarity',
n=4, polarity=1)
get_example_comments('Kat Edorsson Thanks for the Souvenir', polarity_var='overall_polarity',
n=4, polarity=-1)
get_example_comments('Kat Edorsson Thanks for the Souvenir', polarity_var='overall_polarity',
n=4, polarity=1)
plot_grouped_sentiment_extremes(reddit_df, sentiment_col='sentence_subj',
lower=.005, upper=.995, include_zeros=False)
Looking at the subjectivity of the sentences, we see some interesting results. In particular, there are some examples of popular, well-known male characters (Ozzy, Jonas, and John) recieving less subjective evaluations than some female players. We investigate this in the next graphs, and it appears there os a slight increase in subjectivity at the extremes (the players with the highest/lowest subjectivity scores) but not in general for female vs. male contestants.
It is also interesting that Ozzy Lusth, Jonathan Penner and Russell Swan have the lower subjectivity scores. To me, this seems to indicate that they are being judged based on their merits in the game. All three of these players have some level of strategic gameplay (or in Ozzys case, great at challenges) that makes them seem like objectively good Survivor players. At the same time, they are controversial players and may not be the most socially well-liked. This isn't a thorough investigation, and we've seemed to gloss over subjectivity so far (and will not dive deeper), but it may be an interesting area for further analysis.
Conversely, at the bottom of the list, you see players like Malcolm Freberg and Denise Stapley. I'm focusing on these two because they're the most immediately familiar to me, but they are characters that tend to be well liked, and were underdogs (their tribes lost a lot at the start of the game). Perhaps this has led them to being ranked higher here. (Also of note, of course, are the episodes that these occur in. For Malcolm, it was actually an episode on Caramoan, which was the returning season. Malcolm was well liked previously, and perhaps was one of the few returning "favorites" who was actually liked. Perhaps this made people less objective about him?)
Of course, these are all hypotheses and would require some further research to investigate, but these are interesting results nevertheless.
Breakdowns by Gender¶
After the subjectivity analysis, as well as some data I have read concerning attitudes towards different genders of survivor players, I wanted to take a look at the subjectivity and polarity based on the contestants sex. The following graphs only consider the most (and least) subjective players for each sex.
The results are inconclusive -- it does not appear, based on this information alone, that the distributions are substantially different from one another. This statement is supported by the simple linear regression model later, which investigates what is predictive of the sentence sentiment, where Sex is removed as an insignificant feature.
plot_grouped_sentiment_extremes(reddit_df, plot_group='sex', sentiment_col='sentence_subj',
lower=.05, upper=.95, include_zeros=False)
plot_grouped_sentiment_extremes(reddit_df, plot_group='sex', sentiment_col='sentence_polarity',
lower=.05, upper=.95, include_zeros=False)
Now we've dug into the sentiment analysis a bit, the next step will be to build a linear model for sentiment and see which variables are significant. In the [next post], we will fit a linear model to the sentence_polarity variable and take a look at some of the coefficients to determine a statistically significant effect. This will be an inferential model, in the sense that we are not aiming for predictive accuracy but to explain changes in the dependent variable.